mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

pockers21 · 2025-10-14T09:04:34Z

Update Notes (2025‑11‑6)

CLI Merge
- Fold the standalone Jina CLI into mtmd-cli’s projector‑only flow; remove the extra binary.
Conversion Script (set_gguf_parameters)
- Emit vision keys using the standard naming: clip.has_vision_encoder, clip.vision.image_size/patch_size/embedding_length/
  block_count/projection_dim/feed_forward_length/attention.head_count.
- Write only projector_type (set to 'jinaclip2'); do not introduce projector_version.
Inference (mtmd)
- Use ggml_rope_ext to implement 2D RoPE; reuse bicubic for image preprocessing.
Minimal Validation
- Conversion succeeds; gguf_dump shows clip.projector_type='jinaclip2'.
- Minimal inference passes for both text and image; C++ vs Python cosine/RMSE are within the expected range.
  Reproduction

Minimal commands & data (CPU)

Produce GGUF (with ST pooling metadata)
- Text: jina-bert-v3.pooling_type = MEAN/CLS/LAST
- Vision: clip.projector_type = jinaclip2, clip.vision.rope_theta = 10000 (default)
Text parity
- C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-embedding -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0 --pooling mean --embd-normalize 2 --embd-output-format array
- Python: python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off
- Metric: read both 512-d outputs and compute cosine / RMSE
Image parity
- C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-mtmd-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0 --embd-normalize 2 --embd-output-format array
- Python: python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off
- Metric: read both 512-d outputs and compute cosine / RMSE

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Overview

Converter: write jina-bert-v3 text tower params into GGUF (supports both merged-LoRA checkpoints and adapter-based inputs), and export vision metadata (projector_type=jinaclip, vision.rope_theta, image_size, patch_size, projection_dim, etc.).
Runtime: introduce PROJECTOR_TYPE_JINACLIP in the MTMD path (JinaCLIP v2 vision tower: 2D RoPE with shared frequency cache, attention/FFN internal LayerNorm, single-token output), and normalize with common_embd_normalize(..., 2).
CLI (core): add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.
Compatibility: only activates when related GGUF metadata exists; doesn’t affect other projectors (e.g., LLaVA/Qwen2VL); no ggml op changes; no external dependencies.

Scope of changes

convert_hf_to_gguf.py
- Text: support both merged-LoRA single checkpoints and adapter-based export.
- Vision (JinaCLIP v2): export clip.projector_type=jinaclip, clip.vision.rope_theta (configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.
tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h
- Add PROJECTOR_TYPE_JINACLIP: JinaCLIP v2 vision tower (2D RoPE with shared freq cache), attention internal LN, FFN sub-layer LN (enabled when both weight/bias present), single-token output (CLS-equivalent), unified L2 normalize.
- clip_n_output_tokens() returns 1 for JinaCLIP; clip_n_mmproj_embd() returns projection_dim.
tools/mtmd/jinaclip-cli.cpp, tools/mtmd/CMakeLists.txt
- Add llama-jinaclip-cli target (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.

Validation summary

CI: CPU-only ci/run.sh passes locally; no ggml op changes in this PR.
Correctness: embedding models have no perplexity; we verify via C++ vs Python parity.
- TEXT (CPU, minimal sample): cosine=0.999996, RMSE=0.000125
- IMAGE (CPU, minimal sample): cosine=0.990261, RMSE=0.006168
Performance: checked with CLI encode_ms and thread scaling; no regression observed. More data can be added if requested.
Compatibility: activated only when GGUF metadata (projector_type=jinaclip, etc.) is present; other projectors unaffected.
Reference: ModelScope uniontech-yourong/split_jina (used for Python-side parity).

Performance (absolute metrics, CPU-only minimal samples)

Environment
- OS: Ubuntu 22.04.5 LTS
- CPU: Intel Xeon Platinum 8352V (dual-socket, 2×32C/64T, SMT on), 128 threads total
- Build: Release, GGML_CUDA=OFF (CPU-only), GCC 11.4, CMake 3.22
- Model: JinaCLIP v2 vision tower (image_size=512, patch=14, depth=24, hidden=1024; official: https://huggingface.co/jinaai/jina-clip-v2); text tower (Jina Embeddings v3, output truncated to 512 dims)
- Threads: primarily 8 threads for both text/image (with 1-thread comparison)
Metric definitions
- Text: use CLI-reported JINACLIP_ENCODE_MS (pure inference, excludes load)
- Image: use CLI line “image … done in … ms” (pure inference, excludes load)
Results (single sample, minimal)
- Text (“hello world”, ≈5 tokens)
  - 1 thread: encode_ms ≈ 180.48 ms
  - 8 threads: encode_ms ≈ 34.08 ms
- Image (512×512, single)
  - 8 threads: image done in ≈ 6154 ms (stabilizes ~6.1–6.4 s after warm-up)
Notes
- Above numbers are CPU-only pure inference; end-to-end (including model load) is higher and not included.

GPU group (absolute metrics, minimal samples)

Environment
- GPU: NVIDIA vGPU-32GB (cc=8.9, 32 GB), Driver 550.107, CUDA 12.4
- Build: Release, GGML_CUDA=ON (CUDA backend), CUDA arch=89
- Threads: -t 8 (host-side preprocessing threads)
Results (pure inference, excludes load)
- Text (“hello world”, ≈5 tokens): encode_ms ≈ 84.88 ms
- Image (512×512, single): image done in ≈ 827 ms

ngxson

add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;

I don't see why wee need to add this new CLI. The mtmd-cli can do this with -p and --image params

tools/mtmd/CMakeLists.txt

convert_hf_to_gguf.py

ngxson · 2025-10-14T10:05:39Z

convert_hf_to_gguf.py

+
+        # Top-level direct mappings
+        if src_no_vm == 'cls_token':
+            return [('v.cls_token', data_torch)]


Use proper mapping instead

ngxson · 2025-10-14T10:09:13Z

tools/mtmd/clip.cpp

+        if (!ctx->jinaclip_rope_initialized) {
+            const int          half_dim = rope_dim / 2;
+            std::vector<float> base_freqs(half_dim);
+            for (int i = 0; i < half_dim; i++) {
+                float arange_val    = i * 2.0f;                     // [0, 2, 4, ..., 30]
+                float normalized    = arange_val / rope_dim;        // [0, 2/32, 4/32, ..., 30/32]
+                float theta_powered = powf(freq_base, normalized);  // theta^normalized
+                base_freqs[i]       = 1.0f / theta_powered;         // 1.0 / theta^normalized
+            }


Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)

This isn’t re‑implementing generic 2D RoPE; it implements JinaCLIP’s VisionRotaryEmbeddingFast.
It uses fractional‑position 2D RoPE (t = arange(ft)/ft * pt) and precomputes a full H×W cos/sin grid; the official 2D RoPE uses integer grid positions (pos_h/pos_w) with ggml_rope_ext and does not include these steps.
This is done to strictly match Jina’s Python semantics.

fractional‑position 2D RoPE (t = arange(ft)/ft * pt)

Based on your code:

time_seq[i] = (float) i / ft_seq_len * pt_seq_len; // [0, 16/36, 32/36, ..., 560/36] ... freqs_h[t * half_dim + f] = time_seq[t] * base_freqs[f];

Then why don't we scale base_freqs[f] instead? The third param of ggml_rope_ext, the c tensor (freq_scale) is made for this purpose.

Honestly I think this is just YaRN

tools/mtmd/clip.cpp

convert_hf_to_gguf.py

CISC · 2025-10-28T10:14:17Z

@pockers21 What's up?

pockers21 · 2025-10-29T07:19:34Z

@pockers21 What's up?

I’m currently adjusting the code and fixing issues. I originally planned to answer your questions together when
moving the PR from draft to a formal PR, let me explain now. The link you shared (https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38) points to the official Jina config that includes LoRA. In our work, we
modified the official Jina to fuse the text-side LoRA into the base model and then exported it to GGUF. Under JINA
logic, those fields won’t take effect when loading Jina v2; they are only triggered when loading the embeddings v3
model.

…icubic;switch to 'jinaclip2'; fix converter constants

CISC · 2025-11-15T16:19:25Z

@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use tensor_mapping.py where possible.

…icubic;switch to 'jinaclip2'; fix converter constants

Remove unnecessary try/except Jina text hparams. Co-authored-by: Sigbjørn Skjæret <[email protected]>

pockers21 · 2025-11-20T05:37:30Z

@pockers21 You need to address the tensor mappings, as pointed out by @ngxson, use tensor_mapping.py where possible.

Done, please review again.

CISC · 2025-11-20T18:35:56Z

Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).

This means that:

It thinks that the text model architecture is JinaCLIPModel and fails
The vision model conversion fails assert self.n_embd_text > 0, "n_embd not found in hparams"

I tried kludging it by copying in values, but I got several other failures, so it's just not working...

pockers21 · 2025-11-21T02:32:13Z

Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).

This means that:

It thinks that the text model architecture is JinaCLIPModel and fails

The vision model conversion fails assert self.n_embd_text > 0, "n_embd not found in hparams"

I tried kludging it by copying in values, but I got several other failures, so it's just not working...

The original Jina model is a single multi-modal checkpoint that contains both text and vision components, and the text side includes a LoRA head. In our workflow, we did two things:

We split this original model into separate text and vision parts.
We merged the text LoRA head back into the text encoder weights.

If you want to run conversion, you should follow the layout used here:
https://www.modelscope.cn/models/uniontech-yourong/split_jina/files

Concretely, our implementation assumes that you:

git clone that model repo locally (after opening the page and clicking the “Download model” button, ModelScope will show you the exact clone/download command).

After the download, run:

python3 convert_hf_to_gguf.py ORIG_IMAGE_PATH --outfile out.gguf --mmproj

Here, ORIG_IMAGE_PATH must point to the split_jina/image directory.
In that directory you will see a vision_model_weights.bin file, which is what the converter expects to load for the vision encoder.

pockers21 · 2025-11-27T08:54:11Z

Hmmm, there's a major issue with conversion, the text_config is normally applied on top of the remote jina-embeddings-v3 config.json by transformers, however convert_hf_to_gguf.py has no concept of this when reading the jina-clip-v2 config.json (because trust_remote_code=False).

This means that:

It thinks that the text model architecture is JinaCLIPModel and fails

The vision model conversion fails assert self.n_embd_text > 0, "n_embd not found in hparams"

I tried kludging it by copying in values, but I got several other failures, so it's just not working...

Looking forward to your feedback.

CISC · 2025-11-28T21:41:40Z

TBH, I'm not sure this is acceptable, I would expect to be able to convert the original model, granted it's a little tricky due to the way it's constructed, but should be doable.

It might be acceptable to have a preprocessing script for it, but that's not ideal, @ngxson any opinions?

pockers21 requested review from CISC and ngxson as code owners October 14, 2025 09:04

github-actions bot added examples python python script changes labels Oct 14, 2025

ngxson requested changes Oct 14, 2025

View reviewed changes

ngxson reviewed Oct 14, 2025

View reviewed changes

CISC reviewed Oct 14, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

pockers21 force-pushed the feature/jinaclip-v2-projector branch from fd37a5c to 9d02918 Compare October 22, 2025 08:39

pockers21 force-pushed the feature/jinaclip-v2-projector branch from 9d02918 to e19eb27 Compare October 22, 2025 10:35

pockers21 force-pushed the feature/jinaclip-v2-projector branch from e19eb27 to 2d8885b Compare October 22, 2025 10:36

pockers21 force-pushed the feature/jinaclip-v2-projector branch from 2d8885b to b9f78de Compare October 22, 2025 11:07

pockers21 force-pushed the feature/jinaclip-v2-projector branch from b9f78de to 2787888 Compare October 23, 2025 02:07

pockers21 marked this pull request as draft October 24, 2025 05:45

pockers21 force-pushed the feature/jinaclip-v2-projector branch 2 times, most recently from 46f9ee2 to 542ed6a Compare October 28, 2025 03:17

pockers21 force-pushed the feature/jinaclip-v2-projector branch 4 times, most recently from 445e0d5 to bd46020 Compare October 28, 2025 10:02

pockers21 force-pushed the feature/jinaclip-v2-projector branch 2 times, most recently from 7e0b15b to 2338880 Compare October 29, 2025 08:43

pockers21 force-pushed the feature/jinaclip-v2-projector branch from a429271 to e039046 Compare November 14, 2025 13:50

pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 15, 2025

address ggml-org#16574; fold CLI into mtmd-cli; use ggml_rope_ext + b…

7ea1e82

…icubic;switch to 'jinaclip2'; fix converter constants

pockers21 force-pushed the feature/jinaclip-v2-projector branch 5 times, most recently from 9898603 to f5b8651 Compare November 15, 2025 09:33

pockers21 requested a review from CISC November 15, 2025 14:07

pockers21 force-pushed the feature/jinaclip-v2-projector branch 2 times, most recently from eefff38 to a475f81 Compare November 18, 2025 08:10

pockers21 force-pushed the feature/jinaclip-v2-projector branch from a475f81 to ff3cfc0 Compare November 18, 2025 09:30

pockers21 force-pushed the feature/jinaclip-v2-projector branch 2 times, most recently from f86e9fd to 76782e1 Compare November 19, 2025 08:55

pockers21 force-pushed the feature/jinaclip-v2-projector branch from 76782e1 to a2fef90 Compare November 19, 2025 08:58

liyang and others added 4 commits November 20, 2025 09:35

address ggml-org#16574; fold CLI into mtmd-cli; use ggml_rope_ext + b…

2952542

…icubic;switch to 'jinaclip2'; fix converter constants

Simplify Jina BERT v3 detection logic

8461306

Remove unnecessary try/except Jina text hparams. Co-authored-by: Sigbjørn Skjæret <[email protected]>

remove unused fused QKV mapping

49c98be

Refactor JinaCLIP vision mmproj mapping to use tensor_mapping table

6617024

pockers21 force-pushed the feature/jinaclip-v2-projector branch from a2fef90 to 6617024 Compare November 20, 2025 01:35

bluebread mentioned this pull request Nov 30, 2025

First DeepSeek-OCR working implementation sfallah/llama.cpp#7

Open

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

Are you sure you want to change the base?

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

Conversation

pockers21 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

pockers21 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CISC commented Oct 28, 2025

Uh oh!

pockers21 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 15, 2025

Uh oh!

pockers21 commented Nov 20, 2025

Uh oh!

CISC commented Nov 20, 2025

Uh oh!

pockers21 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pockers21 commented Nov 27, 2025

Uh oh!

CISC commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pockers21 commented Oct 14, 2025 •

edited

Loading

ngxson Oct 15, 2025 •

edited

Loading

pockers21 commented Oct 29, 2025 •

edited

Loading

pockers21 commented Nov 21, 2025 •

edited

Loading